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ABSTRACT 



The method of buffering packets in a digital communica- 
tions device includes defining an n-level hierarchy of ' 
memory partitions, wherein each non-top level partition has 
one or more child partitions at an immediately lower level of 
the hierarchy. The memory partitions at the top-most level 
are pre-configured with a target memory occupancy size,- 
and the target occupancy for each memory partition situated 
at a lower level is recursively computed in a dynamic 
manner based on the aggregate congestion of its parent 
partition, until the target occupancies for the bottom-most 
memory partitions are determined. Each traflSc flow which-^ 
uses the buffer is associated with one of the memory;, 
partitions at the bottom-most level of the hierarchy and 
packet discard is enabled in the event the actual memory 
occupancy of a trafiBc flow exceeds its target occupancy. The 
memory partitions at higher levels are preferably associated 
with a set of traffic flows, such as traffic flows associated 
with a particular egress port and class of service, to thereby 
selectively control aggregate congestion. Traffic flow sets 
may be also be defined in respect of adaptive flows such as 
TCP flows which decrease their transmission rates in 
response to congestion notification, and non-adaptive flows 
such as UDP flows which do not decrease their transmission 
rates. Random early detection (RED) is appUed to such 
traffic flows based on the target occupancy of the corre- 
sponding memory partition. The method is expected to ' 
improve network performance, allow fuU buffer sharing, 
permit the weighted distribution of buffer space within a 
memory partition, and scale easily to large systems. 

18 Claims, 13 Drawing Sheets 
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DYNAMIC BUFFERING SYSTEM HAVING memory partiliomnent between interface ports, and between 

INTEGRATED RANDOM EARLY classes of service associated with any given interface port. 

DETECTION One typically implemented buffer management scheme 

designed to minimize buffer congestion of TCP/IP flows is 

FIELD OF THE INVENTION ' ^> algorithm. Under RED, 

riiiuj Kjr int, UNviiiNiiuiN packets are randomly dropped in order to cause different 

The invention generally relates to a method and system traffic flow sources to reduce their transmission rates at 

for buffering data packets at a queuing point in a digital different times. This prevents buffers from overflowing and 

communications device such as a network node. More causing packets to be dropped simultaneously from multiple 

particularly the invention relates to a system for achieving a sources. Such behaviour, if unchecked, leads to multiple 

fair distribution of buffer space between adaptive flows of TCP sources simultaneously lowering and then increasing 

traflSc, the sources of which decrease their transmission rale Iheir transmission rates, which can cause serious oscillations 

in response to congestion notification, and non-adaptive in the utilization of the network and significantly impede its 

flows of trafBc, the sources of which do not alter their performance. RED also avoids a bias against bursty trafSc 

transmission rate in response to congestion notification. since, during congestion, the probability of dropping a 

packet for a particular flow is roughly proportional to that 

BACKGROUND OF THE INVENTION flow's share of the bandwidth. For further details concerning 

In order to effect statistical multiplexing in a store and ^ ^^y^ Jaoobson, Random Early Detection 

forward digital communications device, such devices will Gateways for Congestion Avoidance, 1993 IEEE/ACM 

typically queue data packets for subsequent processing or ^ Transactions on Networking. 

transmission in a common storage resource such as a However, it has been shown that RED does not always 
memory buffer. At such a gateway or queuing point, the fairly allocate buffer space or bandwidth amongst traflSc 
common storage resource may be i^ared by traffic flows flows. This is caused by the fact that at any given time RED 
associated with various classes of service, interface ports, or ^ imposes the same loss rate on all flows, regardless of their 
some other common attributes which define an aggregation bandwidths. Thus, RED may accidentally drop packets from 
of the most granular traffic flows. With traf5c of such a same connection, causing temporary non-uniform drop- 
multi-faceted nature, sophisticated communication devices pi°g aniong identical flows. In addition, RED does not fairly 
need some type of congestion control system in order to allocate bandwidth when a mixture of non-adaptive and 
ensure that the common storage resource is "fairly" alio- ^ adaptive flows such as UDP and TCP flows share link 
cated amongst the varioiis traffic flows. resources. TCP is an adaptive flow because the packet 

transmission rate for any given flow depends on its conges- 

INTERNET ROUTERS tion window size which in turn varies markedly with padcet 

For example, in an Internet router, the transport level identified by non-receipt of a corresponding 

protocol may be some form of TCP (Transmission Control 35 acknowledgement within a predetermined time-out period). 

Protocol) or UDP (User Datagram Protocol). The datagrams ^ non-adapUve because their packet transmis- 

or packets of such protocols are somewhat different and ^^^^ mdependent of loss rate. Thus, unless UDP 

hence may be used to define different traffic flows. Within controUed through a fair discard mechanism, 

each of these protocols the packets ma y be associated with compete unfairly with TCP sources for buffer space and 

o n e 01 several possible classes or -qSaEtics of se rvice which 40 bandwidth. See more parUcularly Un and Morris. ZJyfiomK^^ 

m^ajLfiiilhfiLdefineth^^ a hierarchically lower of Random Early Detection, Proceedings of SIGCOMM'97. 
level of aggregation or higher level of granularity. (A)^ A variant of the RED algorithm that has been proposed to 



n umber quality of serv ice scheme s for the Internet ar e 
c feently being proposed bV VariOUs standard-settii^ g bodies 



overcome these problems is the Flow Random Early Drop 
(FRED) algorithm introduced by Lin and Morris, supra. 
However, one drawback of FRED is the large number of 
state variables that needs to be maintained for providing 



and other. organizations, including the Integrated Service/ 
RSVP model, the Differentiated Services (DS) model, and 

M ulti-Protocol Label Switchin g f MPLS), and the reader is ' f isolation between adaptive and non-adaptive flows. This can 

referred to Xiao and Lee, Internet QoS: A Big Picture, ' P"*^^ problematic for high capacity, high speed, routers, and 

Department of Computer Science, Michigan State ^^^^^ solutions are sought. ^ 
University, <http://www.cse.msu.edu/~xiaoxipe/ 50 A ^ 

researchLink.html>, Sep. 9, 1999, for an overview of these Svs^tch 

schemes.) Still more granular traffic flows may be defined by in an asynchronous transfer mode (ATM) communication 

packets which share some common attributes such as origi- system, the most granular traffic flow (from th e ATM^ 

nating from a particular source and/or addressed to a par- perspect ive) is a virtual connection (WC\ wh ich ma>rbelong 

ticular destination, including at the most granular levels 55 to T)ne ftt a num be r of different typ es of"^alitv of service 

packets associated with a particular application transmitted catbgones. 'Ib eTOT^ Forum TraffiTmHa^ ent working 

between two end-users. g roup has defj ppH fiva (^^»ih. nLn.».^ 

In an IP router the memory buffer at any given gateway or eSTegories, which are distinguished by the parameter sets 

queuing point may be organized into a plural number of which describe source behaviour and quality of service 

queues which may, for example, hold packets in aggregate 60 (QoS) guarantees. I licse catefi ories-tflGhide-cnnst^nt hit rate 

for one of the classes of service. Alternatively, each queue ^CBR). real time variable bit rate^ ^rfAajR^j^-nfm-rftaf time 

may be dedicated to a more granular traffic flow. Regardless variable bit rate ^ (nrtVB R)^ available hit rate (ARR), and 

of the queuing structure, when the memory buffer becomes u nspecified bitjale-^UHKnc^^ E^tegorie s^^ 

congested, it is often desirable to apportion its use amongst USr service categories are intended to carry data traffic 

traffic flows in order to ensure the fair distribution of the 65 which has no specific cell loss or delay guarantees. UBR 

buffer space. The distribution may be desired to be effected service does not specify traffic related guarantees while ABR 

at one or more different levels of aggregation, such as service attempts to provide a minimum useable bandwidth. 
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designated as a aiinimum cell rale (MCR). The ATM Fonun 

Traffic Managetnent working group and International Tele- MC/f, 

communications Union (TTU) have also proposed a new E mcr^ 

service category, referred to as guaranteed frame rate (GFR). 

GFR is intended to provide service similar to UBR but with 5 

a guaranteed minimmn useable bandwidth at the frame or ^ ^ „f buffer aUocated to the connection, is 

MCR^il^ter distributed at random between the connections. 

In an MM cfcvice such as a negyork switch the memory 'n^/rlTt/h.S''' '^^''"^ ''^^^^^ "^^Vf 

. buffer att Sy giilfei lJLueuiug TSSi a may be org a nized into a t'^Xl h u ^^T", ^^^^ «bresholds. 

N-^ pluraLaualh£ki feesjto^m^ diu packets IB " f*^ t ' '^^."'^1^ I^T o 

A?' ^ . laggregate for VCS associated one of the service cat- 1^°^'*',.^^??'^ ' ^^"""^ ^ S*^'** ' 

^-vA-ffCA^egories. Alternatively, each queue may be dedicated to a Proc^dm^ of I.E.E.E. Infocom 96, March 1996, pages 679 

particular VC. Regardless of the queuing structure, eadjVC to 686. In this scheme the threshold associated with each VC 

.{^l^^oan be considered as a traffic flow and groups of VCs, ^ periodically upgraded based on the unused buffer space 

<v \?V^ spanning one or more queues, can also be considered as a ^ tbe MCR value of a connection. Packet discard occurs 

zJraffic flow defined at a hierarchicaUy higher level of aggre- when the VC occupancy is greater than the VC threshold. 

/UlW gation or lower level of granularity. For instance, a group of This method reserves buffer ^ace to prevent overflows. The 



^^0^^^ outpi 




associated with a particular service class or input/ amount of reserved buffer space depends on the number of 

output port may define a traffic flow. When the memory active connections. When there is only one active 

buffer becomes congested, it may be desirable to apportion 20 connection, the buffer is not fiilly utilized, i.e., full buffer 

its use amongst service categories, and amongst various sharing is not allowed. 

traffic flows thereof at various levels of granularity ,. For In conclusion, some of the above-mentioned prior art does 

* ^ ice, in a net work where GFR and ABR connections are not fairly distribute buffer space or idle buffer space between 

[ding for buffer space, it may be desired to achieve^ a traffic flows. Other prior art buffer management schemes 

stribution of the memory buffer between these service 25 also do not allow for full buffer sharing. Another drawback 

tegones and between me mdividuai VL^ thereof.*' with some prior art buffer management schemes is that they 

The problem of providing fair allocation of buffer space do not address the allocation of buffer space to contending 

jad^tive and non-adaptive flows also exists in ATM traffic flows defined at multiple levels of aggregation/ 

tems. With the introduction of IP over ATM. VC s may granularity. The invention seeks to overcome or alleviate 

one or more IP flows, where each IP flow can be 30 some or all of these and other prior art limitations, 

iptive or non-adaptive. Thus, some VCs may be adaptive In what follows, unless the context dictates otherwise, the 

in nature, others may be non-adaptive in nature, while still term "traffic flow" refers to the most-granular flow of 

others may be mixed. A fair aUocation of buffer space packets defined in a buffer management system. Designers 

between such VCs is desired. may use their discretion to define the most-granular flow. 

A number of prior art fair buffer allocation (FBA) ^5 The term "traffic flow set" refers to an aggregation or 

schemes configured for ATM systems are known. One such grouping of the most-granular traffic flows. In the context of 

scheme is to selectively discard packets based on policing. the present invention, a traffic flow set may also consist of 

For an example of this scheme in an ATM environment, a a single traffic flow. Thus a traffic flow set as understood 

packet (or more particularly, "cell" as a data packet is herein comprises one or more traffic flows. 

^TfiHH "t^'i^, ""n ''"''^ ^^"^^ ^'■'•.'•''^ SUMMARY OF TOE INVENTION 
CLP iield is set to 1) if the correspondmg connection 

exceeds its MCR, and when congestion occurs, discard Broadly speaking, one aspect of the invention relates to a <^ fl^^^^ 

priority is given to packets having a cell loss priority (CLP) method of processing packets at a queuing p oint in a \ 

field set to zero over packets having a CLP field set to one. communications device having a ^ared memory buffer. The 

See ATM Forum Technical Committee, (Traffic Manage- 45 method includes receiving and associating pacKeis wiih one 

ment working group hving fist)", ATM Forum, btd-tm- of a plurafity of traffic flow sets. These sets are defined so as 

01.02, July 1998. This scheme, however, fails to fairly ^ logically contain either adaptive traffic flows or non- 

distribute unused buffer ^ace between connections. adaptive traffic flows, but not both. Each traffic flow set is 

Another known scheme is based on multiple buffer fill associated with a target memory occupancy size which is 

level thresholds where a shared buffer is partitioned with dynamically computed in accordance with a pre-determined 

these thresholds. In this scheme, packet discard occurs when dynamic fair buffer aUocation scheme, such as a preferred 

the queue occupancy crosses one of the thresholds and the recursive fair buffer allocation method described herein, 

connection has exceeded its fair share of the buffer. The fair ^affic flow sets is in a congested state, 

buffer share of a connection is calculated based on the MCR packets associated therewith are discarded. Congestion is 

value of the connection and the sum of the MCRs of all preferably deemed to occur when the actual memory occu- 

active connections utilizing the shared buffer. However, this P^^^X a given traffic flow set reaches the target 

technique does not provide an MCR proportional share of occupancy size thereof. In addition, packets are randomly 

the buffer because idle (i.e., allocated but not used) buffer, discarded for at least the traffic Dow sets containing adaptive 

which can be defined as traffic flows, or alternatively aU traffic flow sets, prior to the 

60 sets becoming congested, llie probability of packet discard 

N i X within a given traffic flow set is related to the target memory 

V maJo, G, c j occupancy size thereof. This is preferably subject to the 

\ oJL T constraint that the probabifity of packet discard for a given 

traffic flow set is zero if the target memory occupancy size 

65 thereof is below a threshold value (indicative of a relatively 

where is the buffer fiU level, is the buffer segment non-congested buffer), and reaches one when the given 

count for a connection i, and traffic flow set is congested. 
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The foregoing enables a bufferiDg system operating in 
accordance with the method to obtain the benefits of random 
early detection or random early discard since sources of 
traffic are randomly notified of impending congestion, 
thereby preventing serious oscillations of network utiliza- 
tion. Some of the drawbacks of the prior art are also avoided 
since the method ensures that no sources, especially non- 
adaptive traffic flow sources, consiune excessive buffer 
space due to the fluctuating transmission rates of the adap- 
tive traffic flows. This is due to the logical isolation between 
adaptive and non-adaptive traffic flows and the fair discard 
policy enforced by the buffer allocation scheme. 
Furthermore, unlike the prior art the probability of packet 
discard is not static as in the prior art but rather dynamic in 
that it is based on the dynamic target occupancy size. This 
enables the buffer to be utilized to the maximum extent 
possible under the selected fair buffer allocation scheme. 

Potential fair buffer allocation schemes which can be 
employed by the method include those schemes described 
in: 

Choudhury and Hahne, "Dynamic Queue Length Thresh- 
olds in a Shared Memory ATM Switch", ©1996 ffiEE, 
Ref. No. 0743-166X/96; and 

both of which are incorporated herein by reference. 

In various embodiments described herein the method 
employs a novel fair buffer allocation scheme disclosed in 
applicant's co-pending patent application, U.S. Ser. No. 
09/320,471 filed May 27, 1999, which is also described in 
detail herein. In this scheme the memory buffer is controlled 
by defining a hierarchy of memory partitions, including at 
least a top level and a bottom level, wherein eadi non- 
bottom level memory partition consists of one or more child 
memory partitions. The size of each top-level memory 
partition is pre-detennined, and a nominal partition size for 
the child partitions of a given non-bottom level memory 
partition is dynamically computed based on the congestion 
of the given memory partition. The size of each child 
memory partition is dynamically computed as a weighted 
amount of its nominal partition size. These steps are iterated 
in order to dynamically determine the size of each memory 
partition at each level of the hierarchy. The memory parti- 
tions at the bottom-most level of the hierarchy represent 
space allocated to the most granular traffic flows defined in 
the system, and the size of each bottom-level partition 
represents a memory occupancy threshold for such traffic 45 
flows. 

The memory partitions are preferably "soft" as opposed to 
"hard" partitions in that if the memory space occupied by 
packets associated with a given partition exceeds the size of 
the partition then incoming packets associated with that 
partition are not automatically discarded. In the embodi- 
ments described herein, each memory partition represents 
buffer i^ace allocated to a set of traffic flows defined at a 
particular level of granularity. For instance, a third level 
memory partition may be provisioned in respect of all 
packets associated with a particular egress port, and a more 
granular second level memory partition may be associated 
with a subset of those packets which belong to a particular 
class of service. Therefore, the size of a given partition can 
be viewed as a target memory occupancy size for the set of 60 
traffic flows corresponding to the given partition. At the 
lowest level of the hierarchy, however, the partition size 
functions as a threshold on the amount of memory that may 
be occupied by the most granular traffic flow defined in the 
system. When this threshold is exceeded, packet discard is 
enabled. In this manner, aggregate congestion at higher 
levels percolates down through the hierarchy to effect the 
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memory occupancy thresholds of the most granular traffic 
flows. The net result is a fair distribution of buffer space 
between traffic flow sets defined at each hierarchical level of 
aggregation or granularity. 

In the illustrative embodiments, one or more of the 
memory partitions at any given hierarchical level is aUo- 
cated to adaptive traffic flows and non-adaptive traffic flows. 
Packets associated with memory partitions at a pre- 
determined hierarchical level are randomly discarded prior 
to those partitions becoming congested, with the probability 
of discard being related to the size thereof. 

Another aspect of the invention relates to a method of 
buffering data packets. The method involves: 

(a) defining a hierarchy of traffic flow sets, the hierarchy 
including at least a top level and a bottom level, 
wherein each non-bottom level traffic flow set com- 
prises one or more diild traffic flow subsets and 
wherein at one non-bottom hierarchical level each set 
with a group of traffic flow sets comprises either 
adaptive flows or non-adaptive flows (but not both); 

(b) provisioning a target memory occupancy size for each 
top-level traffic flow set; 

(c) dynamically determining a target memory occupancy 
size for each traffic flow set having a parent traffic flow 
set based on a congestion measure of the parent traffic 
flow set; 

(d) measuring the actual amount of memory occupied by 
the packets associated with each bottom level traffic 
flow; 

(e) enabling the discard of packets associated with a given 
bottom level traffic flow set in the event the actual 
memory occupancy size of the corresponding bottom 
level traffic flow exceeds the target memory occupancy 
size thereof thereby to relieve congestion; and 

(f) enabling padcets associated with the traffic flow sets 
containing adaptive flows to be randomly discarded 
prior to the step of discarding packets for congestion 
relief. 

In the embodiments described herein, the target memory 
occupancy size for a given traffic flow set is preferably 
computed by first computing a nominal target occupancy 
size for the child traffic flow sets of a common parent. The 
target memory occupancy size for each such child traffic 
flow set is then adjusted to a weighted amount of the 
nominal target occupancy size. The nominal target occu- 
pancy size for a given group of child traffic flow sets 
preferably changes in accordance with a pre-specified func- 
tion in response to the congestion of their common parent 
traffic flow set. In some of the embodiments described 
herein, congestion is defined as a disparity between the 
target and measured memory occupancy sizes of a parent 
traffic flow set, and geometric and decaying exponential 
functions are deployed for computing the nominal target 
occupancy size for the child sets thereof. 

The invention may be implemented within the context of 
an ATM communications system as disclosed herein. In 
these embodiments, the comparison specified in step (e) is 
preferably carried out prior to or upon reception of the first 
cell of an AJ'M adaptation layer (AAL) frame or packet in 
order to effect early packet discard in accordance with the 
outcome of the comparison. 

In various embodiments disclosed herein, the bottom- 
level traffic flow sets are logically isolated so as to encom- 
pass either adaptive flows or non-adaptive flows, but not 
both. Random early discard is applied as discussed in greater 
detail below to at least the traffic flow sets at a pre-selected 
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hierarchical level which contain adaptive flows, such as VCs 
which cany TCP/IP trafSc. Alternatively, random early 
discard may be applied to all traffic flow sets at a pre- 
selected hierarchical level. This may be desired if, for 
instance, it is not known a priori which VC will be carrying 
TCP/IP traffic and which will be carrying UDP traffic. In 
either case, the probabihty of discard is preferably related to 
the target memory occupancy size of the traffic flow sets at 
the pre-selected hierarchical level. 

The buffering system according to this aspect invention 
scales well to large systems employing many hierarchical 
levels. This is because there are relatively few state variables 
associated with each hierarchical level. In addition, most 
computations may be performed in the backgioimd and 
lookup tables may be used, thereby minimizing processing 
requirements on time critical packet arrival. This system also 
enables fiill bxifler sharing, as discussed by way of an 
example in greater detail below. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing and other a^)ects of the invention will 
become more apparent from the following description of 
specific embodiments thereof and the accompanying draw- 
ings which illustrate, by way of example only, the principles 
of the invention. In the drawings, where like elements 25 
feature like reference numerals (and wherein individual 
elements in a grouping of such Hke elements bear unique 
alphabetical suffixes): ^ 

FIG. 1 is a system block diagram of a conventional switch 
or router architecture illustrating varioys queuing points 
therein; 

FIG. 2 is a system block diagram of a buffering system 
according to a first embodiment of the invention employed 
at one of the queuing points shown in FIG. 1; 

FIG. 3 is a Venn diagram showing how memory is 
hierarchically partitioned in the first embodiment; 

FIG. 4 is a diagram showing the hierarchical partition- 
ment of the memory in the first embodiment in tree form; 



FIG. 15 is a diagram shov^g, in tree form, how the 
memory of the fifth embodiment is hierarchically parti- 
tioned. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

The detailed description is divided in three parts. First, the 
discussion focuses on the preferred recursive fair buffer 
allocation (FBA) system whidi provides full buffer sharing. 
A number of examples of this system are presented. Next, 
the discussion relates to extending the preferred FBA system 
in order to enable random early discard. Finally, the discus- 
sion relates to alternative choices of FBA systems whidi 
may be employed by the invention. 

1. Recursive Fair Buffer Allocation (FBR) System 

FIG. 1 is a diagram of the architecture of a conventional 
"Layer 2" switch or "Layer 3" router designated by refer- 
ence numeral 9 and hereinafter referred to as a "node". The 
node 9 comprises a plurality of ingress and egress line cards 
llA and IIB for interfacing with the network (not shown) 
via physical interface-ports. Ingress,line 9a cards llA^are 
configured to receive packet traffic from the network via 
ingress ports 19a and tran^nit packets to a switching core 13/ 
via egress ports 20a. Thej witdiing co re 13, as is kiiovro i 
the art,-directs each packettotEe"app?5priate egress line 
cards IIB. These line cards^are configured to.receive ] 
traffic from the switchingcorc 13 via ingress ports X^b and 
transmitnackets^j0<^ti hiflCVDfk via egress ports 26 

Th^nne cards llA and III ^s wHl nn !hn vrtnfv^hini; core 
^ * ^^*^^^h "ntffrfi aodiefWgS" device s a nd hence pre sspt 
a p^nt^UP, witbm the node -9 wherein packets^a re queued 
a memory or buffer for subsequent processin g, hv the 
device (hereinafter "queuing point"). At each queuing po int 
ajbuffer management system is prpvfded as par|; nf ih e store 
aud t'orward tunctionaJity. . 

FIG. z shows an example of a buffer management system 
1©^^^ employed in e^r ^ss line card IIB The system 10^^^ 
FIG. 5 is a system block diagram of a buffering system 40 comprises a common storage resource such as a physical 
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according to a second embodiment of the invention; 

FIG. 6 is a diagram showing, in tree form, how the 
memory in the second embodiment is hierarchically parti- 
tioned; 



memory 12, portions of which are allocated, as subsequently 
discussed, t o various logicalJraffic flows 25 carried by or 
multiplexed on aggregate input stream 18. A controller sudi 
as queue management module (Q MM) 24^^^ organizes and 



FIG. 7 is a diagram showing, in tree form, an alternative 45 manages the memory 12 according to a selected queuing 

•vscheme. In the illustrated embodiment, for example, the 
QMM 24^^^ employs an aggregate queuing scheme based on 
service class and egress port. More specifically, the QMM 
24^^^ organizes the memory 12 into ^m ultiple sets 15^ 
50 l ogical queues 17. In each set 15 there preferably exists one 
qEeue for each service class of the communication protocol. 
For instance, when a pplied to ATM c ommunications, each 
set 15 may comprise six^^p) queues r"/ in respect of the CBR, 
^rtVBR, nrtVBR, ABR, UBR, and GFR service classes. 
55 Alternatively, the packets associated with two or more 
service classes may be stored in a common queue in which 
case there may be less than a 1:1 relationship between 
queues and service classes. In any event, the number of sets 
15 preferably corresponds to the number of egress ports 20 
60 of the line card UB, with each set of queues holding packets 
destined for the corresponding egress port. 

Accordingly, as the ingress port 19 receives the packets of 
aggregate input stream 18, the QMM 24^^^ decides whether 



approach to the hierarchical partitiormient of the memory in 
the second embodiment; 

FIG. 8 is a system block diagram of a buffering system 
according to a third embodiment of the invention; 

FIG. 9 is a hardware block diagram of a portion of the 
buffering system of the third embodiment; 

FIG. 10 is a diagram showing, in tree form, how the 
memory of the third embodiment is hierarchically parti- 
tioned; 

FIG, 11 is a system block diagram of a buffering system 
which includes random early detection, according to a fourth 
embodiment of the invention; 

FIG. 12 is a diagram showing, in tree form, how the 
memory of the fourth embodiment is hierarchically parti- 
tioned; 

FIGS. 13A-13C are diagrams showing changes to the 
hierarchical partitionment of the memory in the fourth 
embodiment under various conditions; 



J /ST 



I 



to store or discard a given packet based on certain criteria 
FIG. 14 is a system block diagram of a buffering system 65 described in greater detail below. If a packet is destined to 
which includes random early detection, according to a fifth be stored, the QMM 24^^^ reserves the appropriate amount of 
embodiment of the invention; and memory, associates each packet with the appropriate logical 
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queue 17, and stores the packet in the memory 12. Jn the j \ sets will not have any defined subset. SimUarly, a traffic flow 
illustrated node, the function of matchi ng an inbound packe t 1 set located at the top-most level of the hierarchy will not 
to a given logical queue 17 is based in part on beader or ^^y^ ^ "parent" set. 
ad dress information carried by the packet and stor ed coi>~ 
ne otion configjuration informa tion, but it wiil be understood 



that other node architecture may employ vario us other ] j The memory partitions are "soft" as opposed to "hard" 
m echanism s to pr ovide this capabi lity, ArDUers'22 each partitions, meaning that if the memory space occupied by 

packets associated with a given partition exceeds the size of 
the partition then the QMM 24^^^ does not automatically 



m ecnanisms to proviae inis capapi iity, Aroirers z2 eacn ju 
muliipiex pacKeis troni the logical queues 17 to their cor- f 
responding egress ports 20 according to a selected service 

^eduling scheme such as weighted fair queuing (WFQ). J ^Tc^i^g pa'ckeis '^uisociat^ wi7h"th7pi7tion. 

When a queue/packet is serviced by one of the arbiters 22, ' *^ 



the corresponding memory block is freed, the QMM 24^^^ is 
notified as to which queue was serviced, and the packet is 



Rather, the size of a given partition can be viewed as a target 
/ memory occupancy size for the traffic flow set correspond- 



delivered to the corresponding egress port 20 for transmis- ^ ^^^^ partition. At the lowest level of the hierarchy, 

sion over an aggregate output stream 21. however, the partition size functions as a threshold on the 

The respective Venn and tree diagrams of FIGS. 3 and 4 amount of memory that may be occupied by the correspond- 

show how the physical memory 12 may be partitioned in a ing traffic flow. When this threshold is exceeded, the QMM 

hierarchical manner in accordance with the queuing scheme 24^^^ enables packet discard. In ATM systems, the QMM 

described with reference to FIG. 2. In this example, there are 24^^^ may be configured to effect ceU discard (i.e., at the 

four levels in the hierarchical partitionment of memory 12. 20 ATM layer), or to effect early frame or partial frame discard 

At a first or top level, the memory is logicaUy partitioned fo^ frame based traffic (i.e., at the AAL layer). In routers the 

into a shared buffer ^ace 14< winch ocaipies a subset Oess configured to effect complete or partial frame 

than or equal to) of the amount of fixed physical memory 12. a - rri *' 
The excess memory space above the shared buffer space 

represents free unallocated space. At a more granular second 25 

level, the memory space allocated to the shared buffer 14^^^ The size of each partition is generally variable and 

is partitioned amongst the various egress ports 206 of line dynamically determined by the QMM 24^^^ in order to 

card IIB. At a stiU more granular third level, the memory control the aggregate congestion of the memory 12. More 

space aUocated to each egr^ port is further partitioned into spedficaUy, at each level of the hierarchy, the aggregate 

service classes. At a fourth or bottom level, the memory ^ ^ • * 

11 , J , , . , • o ^. J congestion withm a given parent memory partition is con- 
space aUocated to eadi service class is further partitioned ,r j i_ • - 1 . . . , 
amongst the most granularly defined traffic flows. In the case ^^^^ computing a nominal partiUon size that can be 
of AIM communications, a suitable candidate for these ^PP^^ ^ partiUons (which preferably, 
traffic flows may be individual VCs, as shown, such as although not necessarily, exists at the immediately next 
virtual channel circuits (VCC) and virtual path circuits 35 lower level of the hierarchy). The value of the nominal 
(VPC), but in other types of communication protocols the partition size for the child partitions of a common parent can 
most granularly defined traffic flows may be selected by the be based on a number of factors such as the degree of 
commonality of various other types of attributes, such as congestion, its rale of change or even the mere existence or 
described above with reference to IP routers. non-existence of congestion vathin the parent partition. 

Ingeneral, at each level of the hierarchical partitionment 40 Specific examples are given below. Regardless of the 

of the inemory 12 other than at the bottom most level there function, tiiis process is recursively carried out throughout 

may exist one or more memory partitions. Each ^ch par- ^^^^^ ^ ^^^^ dynamicaUy determine the size for 

tition is further subdivided mto one or more partitions, . ^ . , , i-.. ./ l » t^- 

individuaUy referred to herein as a "child" partition, located ^"^^ P^*'°° ^^^^ ^^^^^ ^ hierarchy. In this manner, 

on a preferably, but not necessarily, immediately lower level 45 ^gS^^e^^e congesUon at higher levels percolate down 

of the hierarchy. In otiier words, one or more intermediate through the hierarchy to affect the memory occupancy 

levels of the hierarchical partitionment may be absent for thresholds for the most granulariy defined traffic flows, 
any one or more traffic flows represented in memory. At the 

bottom-most level of the hierarchy the memory partitions A second embodiment, implemented in software, is now 

are not fimher subdivided. Similarly, a partition located at 50 discussed in order to describe a specific algorithm for 

^^r"^ ^^'^''^^ ^ """^ ^""^^ ' computing the partition sizes. Referring additionally to 

P _. *. , . FIGS. 5 and 6, this more simplified embodiment is directed 

Since m Uie present apphcaUon each memory partiUon ^^^^ ^ single-port buffering subsystem 10<^) wherein the 

(e.g., shared buffer, ports classes, and VCs) repr^ents ^ ^ ^ partitioned into a shared memory buffer 14^) 

memory space noUonaUy allocated to a group or set of one 55 ^^^^ specifically for ATM ABR and UBR traffic. The 

or more traffic flows at vanous levels of granularity, there *^ . . . „ . « 

also exists a corresponding traffic flow hierarchy. For ^"'ammg portion of the memory 12 may be allocated to 

instance, in the embodiment shown in HGS. 3 and 4, one ^ categones, as described previously, or 

fourth level traffic flow set consists of an individual VC25fl, reserved for over-aUocation purposes. FIG. 6 shows the 

and one second level traffic flow set consists of a group of 60 hierarchical partitionment of the memory usmg a tree stnic- 

VCs 2S^^\ including VC 25a, associated with egress port ^^^^^ subsystem 10^^^ features only one egress port, no 

no. 1 (ref. no. 20a in FIG. 2). It wiU be understood from the provision has been made in this hierarchy for partitioning 

present example that a given traffic flow set consists of one memory amongst egress ports as in the previously 

or more traffic flow subsets, individually referred to herein discussed embodiment. Thus the hierarchical partitionment 

as a "child" set, preferably located on an immediately lower 65 of the memory 12 and the corresponding traffic flow hier- 

level of the hierarchy. The exception to this occurs at the archy features only three levels, namely shared buffer 14^^^, 

bottom-most level of the hierarchy wherein the traffic flow service classes 16, and VCs 25. 
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The following pseudo-code demonstrates the algorithm 
executed by the QMM 24^^^ of this embodiment. 

5 

PSEUDO-CODE 



VARIABLE DEFINrnON: 



Per Buffer 

10 

* TBS - A constant which provides a target size for the buffer, in units 
of celts. 

* B_coant - Counter for measuiing the total number of cells stored in 
the buffer, thereby reflecting the amount of shared buSa curreiUly 
being utilized. 

* Last^_count - A variable for holding the measure of the total ^5 
number of cells stored in the buffer during a previous iteration. 

* T5CS - A control variable w^iich is used to set a target size (in terms 
of the number of cells) for a service class within the buffer. T5CS 
varies over time based on a disparity between TBS and B_jcount, as 
explained in greater detail below. 

* FBS - A constant used to provide a lower bound on TSCS. ^ 

* Dl, D2 D3 and D4 - Constants used to effect a geometric series or 
prc^iession, as discussed in greater detail below. 

Per Service Class 

* SC_caunt [i] - Counter for measuring the number of cells in service 
class i, thereby reflecting the actual memory occt^ancy for the ^ 
service class. 

* Last_SC_count [i] - A variable for holding the measure of the total 
number of cells in service class i during a previous iteration. 

* "w^c [*] " A constant used to ^ecify a weight for service class L 

* TVCS[i] - A control variable vt^iich is used to set a target size for a ^ 
connection within service class L TSCS[i] varies over time based on a 
disparity between TSCS*Wsc[i] SC_count(i], as explained in 
greater detail below. 

* TCSmin and TCSmax - Constants used to apply miniimiTn and 
majdmum constraints on the vahie of TVCS[i]. 

Per Cormection 

* VC_count(i]j] - Counter for measuiing the number of cells stored 
for oonnectiQn j of service class L (Note that the number of 
connections associated with each service class may vary and hence j 
may correspondingly have a di^rent range for each value of L) 
MCR[iIj] - Constant indicative of the MCR or weight of VC j of 40 
service class L 

* VCT [ijj] - Variable for the cell discard threshold for connection j of 
logical service class L The cell discard threshold is proportional to the 
corre^onding TVCS[i]; nrarc specifically, VCT [ijj] = 
TVCS[i]»MCR[iIi]. 

INITIAUZAnON: ^5 



(100) TSCS > TBS*FBS 

(102) TVCS[i] := 1 V i, ie{l..N} where N is the number of service 
classes. 

PERIODICALLY CALCULATE TSCS: 50 

(104) if ( (B_count > TBS) & (B_count > Last_B_count) ) 

(10^ TSCS: = TSCS*(1-D1) 

(108) else if (B_count < TBS) 

(110) TSCS: = TSCS/(1-D2) 

(112) end if 55 

(114) subject to constraint that TBS*FBS S TSCS ^ TBS 

(116) Last_3_count > B__count 

PERIODICALLY CALCULATE TVCS[i] (V Q: 



(118) if ((SCL_count[i] > TSCS'wscf iD & (SCL_countf i] > Last_ ^ 

SC_count(i])) 

(120) rVCS[i]: - 'TVCS[i]*(l-D3) 

(122) else if (SC_count[i] < TSCS*WsJiD 

(124) TVCS(i]: - TVCS[iy(l-D4) 

(126) end if 

(128) subject to constraint that TCSmin S TVCS[i] ^ TCSmax 65 

(130) Last^_Countf i] SC_counl[iJ 



12 



-continued 



PSEUDO-CODE 




UPON CELL ARRIVAL FOR VC[iIj]: 


(132) 


VCT [ilj] :- TVCS(i] * MCR[iIi] 


(134) 


if( VC_count[iIj] > VCT [ilj] ) 


(136) 


enable EPD 


(138) 


end if 



The algorithm involves dynamically computing a taiget 
memory occupancy size, i.e., memory partition size, for each 
trafBc flow set. This is symbolized in FIG. 6 by the solid 
lines used to represent each entity. The actual amount of 
memory occupied by each traffic flow set is also measured 
by the algorithm and is symbolized in FIG. 6 by concentric 
stippled lines. Note that the actual size of memory occupied 
by any traffic flow set may be less than or greater than its 
target size. 

The algorithm utilizes current and historical congestion 
information of a given memory partitionAraffic flow set in 
order to determine the nominal target size for its child sets. 
Broadly speaking, the algorithm dynamicafly calculates for 
each traffic flow set: 

(a) a target memory occupancy size, and 

(b) a control variable, which represents the nominal target 
memory occupancy size for the child sets of the present 
set. 

In the algorithm, which is recursive, the target memory 
occupancy size is calculated at step (a) for the present traffic 
flow set by multiplying the control variable computed by its 
parent by a predetermined weight or factor. These weights, 
provisioned per traffic flow set, enables each child set of a 
common parent to have a different target occupancy. 

The value of the control variable calculated at step (b) 
depends on the congestion of the present traffic flow set. In 
the algorithm, congestion is deemed to exist when the actual 
memory occupancy size exceeds the target memory occu- 
pancy size of a given traffic flow set. At each iteration of the 
algorithm, the value of the control variable is decreased if 
congestion currenfly exists and if the traffic flow set previ- 
ously exhibited congestion. This historical congestion infor- 
mation is preferably based on the last iteration of the 
algorithm. Conversely, the value of the control variable 
increases if no congestion exists for the traffic flow set. Thus, 
in this embodiment, the target occupancy for the child sets 
of a common parent are based on a disparity between the 
target and actual memory occupancies of the parent 

Steps (a) and (b) are performed for each traffic flow set at 
a particular level to calculate the respective target occupan- 
cies for the child sets thereof at the next lower level of the 
hierarchy. Another iteration of these steps is performed at the 
next lower level, and so on, until target occupancies are 
calculated for the traffic flows at the bottom-most level of the 
hierarchy. 

For instance, the taiget occupancy for service classes 16A 
and 16B is based on a disparity 30 between the target and 
measured occupancy of shared buffer 14^^^ Similarly, the 
target occupancy for each VC 25^^^ to is based on a 

disparity 34A between the target and measured occupancy of 
service class 16A. When an AAL frame or alternatively 
ATM ceU is received, the algorithm identifies the corre- 
sponding VC and determines whether its actual memory 
occupancy exceeds the target memory occupancy size 
thereof in which case the frame or cell is subject to discard. 
In this manner congestion at higher levels of the traffic flow 



15 



20 



25 



us 6,671 

13 

hierarchy percolates through the cascaded hierarchical struc- 
ture to affect the thresholds of individual connections. 

Referring additionally to the pseudo-code, TBS represents 
the target memory occupancy size for buffer 14^^\ TBS is a 
fixed value at the highest level. TSCS represents a nominal 5 
target size for all service classes 16, and TSCS*Wjc[i] 
represents the target size for a particular service class. The 
factor Wsc[i] is the weight applied to a particular service 
class in order to allow different classes to have various target 
occupancy sizes. Similarly, TVCS[i] represents a nominal lo 
target size for the VCs 25 within a particular service class i, 
and TVCS[i]*MCR[iIj], which is equal to VCT [ijj], 
represents the target size, as well as the cell discard 
threshold, for a particular VC. The factor MCR[i]j] pro- 
vides MCR proportional distribution of buffer space within 15 
a service class. TSCS and the values for each TVCS[i] and 
VCT [ijj] are periodically computed and thus will generally 
vary over time. 

A variety of counters (B_Count, SC_Count [i], 
VC_Counl [iJiJ) are employed to measure the actual 20 
memory occupancy size of the various traffic flow sets. 
These are updated by the QMM 24^^^ whenever a cell is 
stored or removed from buffer 14^\ (The updating of 
counters is not explicitly shown in the pseudo-code.) . 

Lines 100-102 of the pseudo-code initialize TSCS and 25 
TVCSp] V i. TSCS is initialized to a target size of 
TBS*FBS. FBS is preferably equal to 1/N, where N is the 
number of service classes 16 within shared buffer 14^^\ This 
has the effect of initially apportioning the memory buffer 
equally amongst each service class. Other initialization 30 
values are also possible. TVCS[i] is initialized to 1 for each 
connection, as a matter of convenience. 

lines 104-116 relate to the periodic calculation of TSCS. 
line 104 tests whether the actual occupancy of shared buffer 
14^^^ is greater than its target occupancy and is increasing. 35 
If so then at line 106 TSCS is geometrically decreased by a 
factor of 1-Dl, where 0<D1<1, e.g., 0.1. line 108 tests 
whether the actual occupancy of shared buffer 14^^^ is less 
than its target size. If so then at line 110 TSCS is geometri- 
cally increased by a factor of 1/(1 -D2), where 0<D2<1 e.g., 40 
0.05. The values of Dl and D2 are preferably selected such 
that when the target occupancy decreases it does so at a 
faster rate than when it increases, as exemplified by the 
respective values of 0.1 and 0.05. Those skilled in this art 
will appreciate that Dl and D2 control how fast the system 45 
responds to changes of state and that some degree of 
experimentation in the selection of suitable values for Dl 
and D2 may be required for each particular application in 
order to find an optimal or criticaDy damped re^nse time 
therefor. 50 

Line 114 constrains TSCS to prescribed maximum and 
minimum limits of TBS and TBS*FB respectively. The 
maximum limit prevents service classes from attaining a 
target occupancy value beyond the availability of the i^ared 
buffer. The minimum limit bounds TSCS to ensure that it 55 
does not iterate to values that would cause convergence 
times to suffer. 

lines 118-130 relate to the periodic calculation of TVCS 
[i] in relation to service class i. Line 118 tests whether the 
actual occupancy size of service class i is greater than its 60 
target size and is increasing. If so then at Une 120 TVCS[i] 
is geometrically decreased by a factor of 1-D3, where 
0<D3<1, e.g., 0.1. Line 122 tests whether the actual size of 
service class i is less than its target size. If so then at line 124 
TVCS[i] is geometrically increased by a factor of 1/(1-D4), 65 
where 0<D4<1, e.g., 0.05. The values of D3 and D4 are 
preferably selected such that when TVCS[i] decreases it 
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does so at a faster rate than when it increases, as exemplified 
by the respective values of 0.1 and 0.05. 

line 128 constrains TVCSp] to prescribed maximum and 
minimum limits to ensure that convergence times are not 
excessive. TCSmax is preferably equal to TBS/LR, where 
LR is the fine rate of the corresponding output port. This 
upper bound also ensures that a connection can never 
receive more than TBS buffer space. TCSmin is preferably 
equal to TBS/MCRmin, where MCRmin is the minimum 
MCR of all connections. This provides a conservative lower 
bound. 

In this embodiment the QMM 24^^^ effects early packet 
discard (EPD), and thus lines 132^138 are actuated when a 
start-of-packet (SOP) cell is received by the QMM 24^^>. (In 
the AAL5 ATM adaption layer protocol the end of packet 
(EOF) oeU signifies the start of the next packet.) The target 
memory occupancy size or threshold for VC j of service 
class i is evaluated at line 132 when a SOP cell is received. 
The threshold is equal to TVCS[i] multiplied by the MCR of 
the connection. As mentioned earlier, this provides for MCR 
proportional distribution of the buffer space allotted to 
service class i. Line 134 tests whether the number of cells 
stored for VC j exceeds VCT [i][j], its target occupancy. If 
so, then EPD is enabled at line 136 and the QMM 24^^^ 
subsequently discards all cells associated with the AAL5 
frame. Lines 132 to 138 are re-executed upon the arrival of 
the next SOP cell. In the alternative, the system may effect 
a partial packet discard (PPD) policy. Alternatively still, line 
136 may be modified to effect cell discard per se, with Lines 
132-138 being executed upon the arrival of each cell. 

This embodiment is readily scalable to systems having a 
large number of service classes and connections since there 
are relatively few state variables associated with the shared 
buffer and the service classes. In addition, most computa- 
tions may be performed in the background, thereby mini- 
raizdng processing requirements on time critical cell arrival. 

This embodiment also allows full buffer sharing. To see 
why this is so, consider an extreme case where all VCs 
associated with service class 16B cease transmitting cells. In 
this case, the shared buffer 14^^^ begins to rapidly empty, 
causing the measured buffer size to be significantly smaller 
than the target buffer size. This causes the target sizes for 
service classes 16Aand 16B to increase up to a level of TBS, 
the target size of the buffer. In turn, TVCS[i] for all con- 
nections rises to an amoimt which enables the service 
category occupancy to reach TBS. Coi^quently, the entire 
buffer becomes available to all of the transmitting connec- 
tions of service class 16Aand full buffer sharing is achieved. 
Moreover, it will be noted that each VC 25^*^^ to 25^ of 
service class 16A receives a share of the buffer space allotted 
to that service class in proportion to the MCR of the 
connection. Consequently, the instantaneously unused buffer 
space of service class 16A is distributed in proportion to the 
MCRs of the connections within the service dass. 

The method of allocating buffer space has been particu- 
larly described with reference to the three level traffic flow 
hierarchy as shown in FIG. 6. Those skilled in the art will 
understand that the method can be applied with respect to an 
n-level traffic flow hierarchy. 

For example, FIG. 7 shows a four level hierarchy wherein 
physical memory 12 is partitioned amongst multiple egress 
ports 20. The level of the port partitions are disposed 
between the levels for the shared buffer 14 and service 
classes 16. In this hierarchy, the target memory occupancy 
size for each port 20 is based on the disparity 30 between the 
target and measured memory occupancy sizes of shared 
buffer 14, and the target sizes for the service classes 16 
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associated with a given port are based on a disparity 32A or where the pre-provisioned taiget size for the shared buiTer 
32B between taiget and measured memory occupancy sizes 24^^^ is 32k cells, 
of the given port. More specifically, let g{x,y} represent a 

discrete or iterative function wherein if x>y and x is incrcas- TABLE 1 

ing then g{x,y} geometrically decreases and if x<y then 5 
g{^y} geometrically increases. The nominal taiget occu- 
pancy sizes for the various entities in the hierarchy shown in 
FIG. 3 can be: 

TBS=constant, 

TPS=g{B_count, TBS}, 

TSCS[i]=g{P_count[i], Wp|;i]*TPS}, 

TVCS [i, j>g{SC_count[i, j], w^i, j]*TSCS[i]}, and 

VCT [i, j, k>TVCS[i, j]*MCR[i, j, k]. 

In the foregoing, TPS represents a nominal memory .c T^uur • ^ ^ . ^ ^ 

occupancy for ports and vrj] is a weight associated with " appreciated that CS_Qs thus corresponds to 

each port i. The product w Ji]*1TS represents the target size ''^T"^ JZl^ ^ T'^ Tl^'^^ 

for each particular port, wtdch need not be equal. Similarly. f ^''"ff l*"^'" " '^e 
W,c[i. j]* TSCS[i] represents the target size for a particular ^J"?.'"" ^'"'=5 ^'^'"^ congestion diflfere dependmg on 
se^« class j a^ialed with port? ^ ^"JT^, "'^ P°"- , , , 

It should also be noted that g{x.y} may alternatively ?"''°?'y '^^X^ or threshold VCT, for a 

provide progressions other than geometric, includingbut not T^T^^^ *° '=°°\P^^'^}>yj^^^- 
limited to linear. hyperboUc, logarithmic or decaying expo- PlJ^g MCR of the connecUon by a predetermmed value 
nential progressions. Each type of progression vvSl^roWde "f^t "^^Kf^Z"^ Z,*^ T""^ 

different convergence characteristi«. ySso, g{x.y} need not „ f/^"^ ''"f" <»°g«f»°° ^""^iWe CS_Qs. The lookup table 
necessarily consider historical congestion information. t*. P^^^^l » .'ff*^ pre-computed vahies of a pre- 

For example. FIGS. 8-10 show a third embodiment, detemimed fundion. Table 2 diows an example of such a 
implemented in hardware, which only considers current P'«^«=termined function m respect of an OC-12 egress port, 
congestion. This embodiment is directed toward a buffering 

subsystem 10^^^ wherein the physical memory 12 is parti- 3Q TABLE 2 

tioned into a shared memory buffer 14^^^ provisioned for Decimal \Wuc of cs_Qs vcr 
only one of ABR and UBR tmfi&c, or alternatively for traflSc (input) (Output) 
from both classes. The remaining portion of the memory 12 ' 

may be allocated to other ATM service categories, as ^^l^j ^ o.9926094<--<>-») 

described previously, or reserved for over-aUocation pur- 35 [i698, 2,047) 0 
poses. FIG. 10 is a tree diagram showing the hierarchical 

partitionment of the memory for this buffering scheme. r™. , . , .j ^ . 

Since the subsystem 10<^^ features only one egress port and f/"^''^ " ^I^^^""""^- 

no partitionment amongst service classes, the memory par- "^/^ 48S>-1697; a maMmum value of 

titiomnent and corresponding trafBc flow hierarchy only has „ ^?^^ f '° ^^B«= "^^"^"^ ^ 

two levels, namely shared buffer 14P> and \Cs 25. buffer is relaUvely uncongssted; and a minimum 

HG. 9 shows hardware 40 incorporated within the QMM ^'^"^ " ^^"I^'^ '^J^ "^F °^ ^^^2047, 
of this embodiment for determining whether to enable "'^^Jf '° ^ ^ congested, 

or disable packet discard. The hardware 40 comprises three °^ P^""^^ » ^"P^^f^ ^ 

inputs as follows- compares the memory occupancy threshold of the VC, i.e.. 

>>,.'a • . ff .u . . 1 u c II VCT, against VC_countri]. and if the latter is greater than 

^n.^.lwl'^h^Cff 1% r^^'^ 1^ ^ EPD signal 48 is enabled. OtherX the EOF 

occupymgthesharedb,^erl4( thereby reflectmg the ^^^^^ 

actual occupancy size of the shared buffer. This counter ^ 

is incremented/decremented by the QMM 24^^ upon 2. Extending the Recursive FBA to Enable Random 

cell arrival/departure. 50 Early Discard 

VC-Count j: A counter in respect of the total number of FIGS. 11 and 12 show a fourth embodiment which relates 

cells occupied by VC j. This counter is incremented/ to a single-port ATM buffering subsystem 10^^^ capable of 

decremented by the QMM 16 upon the arrival/ carrying IP trafiBc. In this embodiment as shown in FIG. 11, 

departure a cell belonging to VC j. the memory 12 is partitioned into a shared memory buffer 

MCR j: The MCR value of VC j. 55 14^^*^ provisioned specifically for UBR trafiSc. The remaining 

The QMM 16 utilizes the hardware 40 whenever an end portion of the memory 12 may be allocated to other ATM 

of packet cell (of an AAL frame) airives, in which case service categories, as described previously, or reserved for 

congestion control is executed. The Qs counter or variable is over-allocation purposes. Within the UBR shared buffer, 

fed to a quantizing function 42 which produces a quantized adaptive and non-adaptive service classes 16fl and l€b are 

congestion variable CS-Qs, having a pre-specified range of 60 defined and separate queues 17 are provisioned to hold cells 

values, e.g., 0 to 2047 (i.e., an 11 bit quantity). The quan- in aggregate for the VCs 25 of the corresponding service 

tization fimction maps Qs to CS_Qs based on the line rate class. FIG. 12 shows the hierarchical partitionment of the 

of the egress port 20. For example, for a given value of Qs, memory 12. In this embodiment, VCs of the adaptive service 

an egress port having a fine rate of 1.6 Mb/s will map onto class carry adaptive IP fl ows such as TCP flows 50 and VCs 

a lower quantized value CS_Qs than an egress port having 65 of the non-adaptive service class carry non-adaptive tlOWS 

a line rate of 800 kb/s. Table 1 below shows an example of such as Up? flows 52. It is not, however, necessary for eveiy 

this mapping for some common standardized line rates VC in the UBR class to carry IP-based traffic and a thiid 
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service class may be defined for such VCs which may, for 
instance, carry native ATM traffic. In any event, note that the 
VCs are the most granularly defined traffic flows for the 
purposes of hierarchically partitioning the memory 12, but 
that these VCs may carry more granular IP traffic flows. ^ 

The algorithm executed by the QMM 24^"*^ of this 
embodiment is substantially identical to that described 
above in connection with FIGS. 5 and 6 for computing the 
target memory occupancy sizes or thresholds for service 10 
classes and VCs. However, because some of the VCs carry 
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IP-based traffic, the QMM 24t^^^ of this embodiment enables 
the random early discard of IP packets (i.e., AAL5 frames) 
carried by such VCs in order to improve network perfor- 
mance for IP-based traffic. Additional lines of pseudo-code 
for enabling the randord early discard of packets are shown 
below. For simplicity, because it is typically unknown a 
priori which \C& carry adaptive flows and which VCs carry 
unadaptive flows, the additional code is executed for all VCs 
carrying IP-based traffic. Alternatively, the additional code 
may be selectively executed only in relation to those VCs 
carrying adaptive IP traffic if that information is available. 



ADDmONAL PSEUDCKX)DE 



DEFTNTnONS: 
Per Service Class 

Min^jm — A variable Tcpresendng the minimum threshold of pennissible 
memory occupancy for connection j in the UBR (or other) service dass, as 
required by RED. 

Mar,iJil — A variable representing the maximum threshold of permissible 
memory occupancy fior connection j in the UBR (or other) service dass, as 
required by RED. 

a,p - Constants used to weight the VC threshold in order to conqjute Min^, and 
Maxti,. 

Per Connection 

AvgU] — A variable representing the average memory occupancy size of 
connection j in the UBR service class. 

RED_count(j] — A variable rq^resenting the number of IP packets (Le., AAL5 
&ames) received until one is dropped for connection j of the UBR service 
class. This variable is computed based on Avgfj] and a random component 
R[j], and is decremented until it reaches 0, at which point the incoming IP 
packet is discarded or dropped for RED purposes. 
Teir5)__RED_count|j] - A tenqrarary variable. 

R{j] - A random uniform variable in the range 0 ... 1 used to confute 
RED_count[j]. 

PJj] - A variable representing the probability of dropping a packet as a 
function of Avg|j]. 

MaZp ' A global constant setting an iqiper bound on any P^. 

PERIODICALLY COMPUTE RED_COUNTli]Ij} (FOR IP-BASED OONNECnONS) 



(140) MinJi] : - a * VCITUBR] [j] (// where UBR indicates the UBR service 
class) 

(142) Maijj] : - VCItUBRIj] 

(144) //calculate Avg[j] as a fiindion of the number of packets 
(146) if (MiDflJi] < - AvgOl < - MaxJiD 

P™-4 Ma.,.U]-Mir^U] J-^^g^J' Ua^U]-Mi^U] J 

(150) if (RED_counttj] < = 0) 

(152) select Rfi] 

(156) end if 

TempJffiD_countU]:=|^| 

(160) RED__count(i] niin(RED_counl|j], Tfcmp_JlED_count(i]) 

(162) else if(Avg[i] < Minjj]) 
(164) RED_count|j] : - -1 

(166) end if 

UPON PACKET (AALS FRAME) ARRIVAL FOR EACH IP-BASED CONNECHON: 
(168) if(RED_countO] >0) 

(170) RED_count[i]:- RED_count[j] - 1 

(172) end if 

(174) discard packet if RED_count(j] - 0 

(176) // note also that packets arc discarded once VClXUBR][j] is exceeded in 
accordance with pseudo-cxxles lines 132 - 138 discussed above. 
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Referring to the additional pseudo-code, lines 140-166 
are executed as a background process in order to periodi- 
cally compute RED_count[j] for IP-based VCs. The value 
of this variable indicates which future IP packet ^ould be 
dropped or discarded early, i.e., even though VCr[UBRJj] 
for the corresponding connection has not yet been reached. 
Nevertheless, once the memory occupancy for VC[lJBRlj] 
reaches VCItUBR][j], packets are dropped. 

Lines 140 and 142 calculate the minimum and maximum 
thresholds for RED purposes in respect of a given VC 
associated with the UBR (or other) service class. These 
variables can alternatively be provisioned per service class 
or for all coimections in the system inespective of the 
service class. Preferably, however, the minimum and maxi- 
mum thresholds are not static as in prior art implementations 
of RED but are based on the dynamic value of VCHUBR] 

The factor a is preferably selected such that Min^] 
represents a state of non-impending congestion in which 
case the probability of discard should be zero. For instance, 
if a is set to 0.25, no packets will be discarded if the memory 
occupancy is less than 25% of the target memory occupancy 
VCItUBR][j]. Similarly, p is selected such that Max^] 
represents a state of impending congestion in whidi the 
maximum discard probability Max^, should apply. Maxp and 
P may each be set to one (1) if desired so that the probability 
of discard approaches 1 as the target occupancy threshold of 
the VC is reached. 

Line 144 calculates the average memory occupancy size, 
Avg(j], for connection j of the UBR service class. A moving 
average computation, as known in the art per se, is prefer- 
ably employed for this purpose. In the altemative, the 
current memory occupancy size of connection may be used 
to minimize computational complexity. 

At line 146 AvgQ] is tested to see if it falls within the 
range defined by the minimum and maximiun thresholds for 
the connection. If so, then congestion is "anticipated" and 
the drop probability P^] is computed as disclosed in the 
pseudo-code based on how close Avg{j] is to the maximuim 
threshold. At line 150 Red_count[jl, which represents the 
number of packets that may be received until one is ran- 
domly dropped, is tested to see if it is less than or equal to 
zero. If so, this indicates that a packet must be dropped for 
RED purposes. Accordingly at line 152 a new value for the 
random uniform variable RO] is selected and at line 154 a 
new value for Red_count[j] is computed. 

At line 158 a parallel value for Red_count[j] is computed 
based on PJj] as computed at line 148. At line 160 the 
algorithm selects the minimum of the current value of 
Red__coimt|j] and the parallel value for it. These steps are 
present because this part of the algorithm runs as a back- 
ground process and may be invoked asynchronously of Hnes 
168-174 which are triggered upon the arrival of an IP 
packet. Thus, in the absence of lines 158-160, Red_oount|j] 
may be set to a higher value before a packet has been 
dropped in connection with the value of Red_coimtJj] as 
computed in reject of the previous iteration of this back- 
ground process. At the same time, if congestion increases, 
the probabiUty of dropping a packet ^ould increase, causing 
Red_count[j] to decrease in value in the present iteration of 
the process. If this is the case, it is preferred to immediately 
capture the new value of Red_count[i][j]. 

At lines 162-164 Red„count[j] is set to -1 in the event 
Avgjj] is less than the minimum threshold for that connec- 
tion. 

Lines 168-174 are triggered upon the arrival of an IP 
packet (or AAL5 fi-ame). If RED_count|j] is greater than 
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zero it is decremented at line 170. If at line 174 RED_oount 
[j] is equal to zero for the corresponding VC then the packet 
is discarded for RED purposes. Note also that if the memory 
occupancy size of that VC is greater than VCHUBR|j] then 

5 the packet will be discarded in accordance with the EPD 
criteria provided by lines 132-138. 

The present embodiment does not suffer firom the draw- 
backs associated with the use of RED in networks featuring 
adaptive and non-adaptive flows such as the TCP-carrying 

10 VCs and UDP-carrying VCs hereof. The primary reason for 
this is that the recursive FBA scheme fairly apportions the 
buffer space between adaptive flows and non-adaptive flows 
at the service class level. This can be seen in the example 
shown in FIGS. 13A-13C where, for simplicity, only one 

15 VC 25X or 25Y is present in each service class, as shown. 
FIG. 13A iUustrates an initial coiniition wherein the VCs 25jc 
and 2Sy of the adaptive and non-adaptive service classes 16a 
and 16b are transmitting packets at equal rates In this 
condition, the service rate is twice the VC transmission rate 

20 and the system is in equilibrium in that the memory occu- 
pancy size of each element (VC, service class and shared 
buffer) is equal to its target memory occupancy size. At some 
point tj a packet is randomly dropped from VC 25x, such 
that the source decreases its transmission rate by half (V^) 

25 in accordance with the TCP/IP protocol. Ashort time t2 later, 
as shown in FIG. 13B, the actual size of the shared buffer 
14^"*^ (represented by stippled lines) becomes smaller than its 
target size (represented by solid lines) such that disparity 30 
exists. Under the preferred recursive FBA, this will cause 

30 the target sizes (represented by sohd lines) of service classes 
16a and 166 and VCs 25x and 25^^ to increase. If a short time 
tj+later the non-adaptive VC 25>' increases its transmission 
rate to ^++, the system is able to temporarily buffer the 
excess packets corresponding to the diG^erential in transmis- 

35 sion rates (i.e., ^>++-TX) since the target sizes of VC 25y 
and services class 16y have increased and the shared buffer 
14^"*^ is not yet congested. The system will allow full buffer 
sharing as discussed above. However, as the source of the 
adaptive VC 25x increases its transmission rate back to <l>, 

40 the ^ared buffer 14^"^^ begins to fill up, resulting in the target 
sizes of service classes 16fl, 16b and VCs 2Sx and 25>' 
returning to their initial states, as shown in FIG. 13C. 
Consequently, packets will be discarded from the non- 
adaptive VC 25y since its transmission rate is greater than its 

45 initial rate Thus, while the non-adaptive VC can take 
advantage of idle buffer space during the transient period 
required for the source of the adaptive VC 25 to return to its 
nominal transmission rate, under steady state conditions the 
non-adaptive VC 2Sy cannot monopolize the buffer even 

50 through random early discard has been applied. 

Those skilled in the art will recognize that while portions 
of the above pseudo-code are similar to the RED algorithm 
described by Floyd and Jacobson, supra, the invention may 
alternatively employ other variants of RED. These include 

55 the Adaptive Random Early Detection Algorithm (ARED), 
proposed in Feng, W., Kandlur, D., Saha, D., and Shin, IC, 
Techniques for Eliminating Packet Loss in Congested TCP/ 
IP Networks, unpublished, and the Random Early Drop with 
Ins and Outs (RIO) described in Clark and Fang, Explicit 

60 Allocation of Best Effort Packet Delivery Service, IEEE/ 
ACM Transactions on Networking, vol. 6, no. 4, August, 
1998. 

Similarly, those skilled in the art will understand that the 
above algorithm may be modified so that RED or one of its 
65 variants may be applied at the service class level or any other 
level where adaptive and non-adaptive traflSc are distin- 
guished. This is possible whether the VC level is present in 
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the hierarchy or not. For example, FIGS. 14 and 15 show a 
fifth embodiment which relates to a single port IP router 
buffering system 10^^. In this embodiment, as shown in FIG. 
14, the memory 12 is partitioned into a shared memory 
buffer 14^^ which is further partitioned into service class 5 
queues 17a and 17b for adaptive and non-adaptive flows 50 
and 52 contained within input stream 18. FIG. 15 shows the 
hierarchical partitionment of the memory 12. In this 
embodiment, there are only two sudi levels, namely the 
shared buffer and service classes, and the target memory lo 
occupancy sizes for these trafSc flow sets are as follows: 

TBS-constant 

TSCS=g {B_count, TBS} 

SCT[i]^w^irTSCS 
where TBS represents the target occupancy of the shared 
buffer 14<^; 

B_count is a coimt of the memory occupancy size of the 
^ared buffer 14<^; 

w^Ji] is a weight provisioned per service class; ^ 

g{B_count, TBS} is an iterative function providing a 
predetermined progression based on a disparity 
between the actual memory occupancy size of the 
shared buffer 14^^ and the target occupancy thereof 

TSCS represents the nominal target size for each service 25 
class 16; and 

SCT[i] represents the weighted threshold for each service 
class. 

In addition, pseudo-code lines 140-174 for providing 
RED-like functionality are modified for this embodiment by 30 
setting Min^i]:=a.SCr [i] and Max^i]=p.SCIi;i]. lines 
168-174 are triggered whenever an IP packet arrives. 

3. Alternative FBAs 

The foregoing has descnbed the application of random 35 
early detection to the preferred recursive FBA. However, 
other dynamic FBA schemes may be used in the alternative, 
including: 

Choudhury and Hahne, "Dynamic Queue Length Thresh- 
olds in a Shared Memory ATM Switch", ©1996 IEEE, 40 
Ref. No. 0743-166X/96; and 

Guerin et al., "Scalable QoS Provision Through Buffer 
Management", 

Proceedings of the ACM S19 COM Vancouver, Septem- 
ber 1998, all of which is incorporated herein by refer- 45 
ence. 

In each of the foregoing schemes a target or threshold size 
is established for a particular type of traffic flow. For 
instance, the Choudhury and Hahne scheme may be used to 
dynamically establi^ a threshold memory occupancy size 50 
for VCs in an ATM switch. The network may be configured 
so that VCs carry either adaptive or non-adaptive IP flows, 
but not both. Once the different types of flows are logicaUy 
isolated, the pseudo-code described in lines 140-176 may be 
employed to apply random early detection in accordance 55 
with the invention. In this case, the Min^ and the Max^^ 
thresholds computed in lines 140 and 142 are based on the 
VC thresholds computed by the Choudhury et. al. reference. 
The drawbacks of the prior art associated with the use of 
RED are also avoided by this embodiment since the FBA 60 
scheme ensures that the non-adaptive VCs do not monopo- 
lize the shared buffer. However, because the Choudhury et. 
al. FBA scheme reserves buffer space lo prevent overflows, 
this embodiment does not allow for full buffer sharing, in 
contrast to the preferred FBA scheme. 65 

Those skilled in the art wiU understand that while the 
embodiments descnbed herein have disclosed two, three and 
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four level memory partition/traffic flow hierarchies, far more 
elaborate hierarchies may be constructed. Other possible 
hierarchies specific to the ATM environment include (from 
top level to bottom level): 
buffer, port, service category, groups of virtual circuits, 

individual virtual circuits; 
buffer, port, service category, queue, virtual circuit; 
buffer, port, service category, virtual path aggregation 

(VPA), and virtual circuit; 
buffer, port, service category, virtual private network 

(VP>0, and virtual circuit; 
buffer, port, service category, VPN, VPA, and virtual 
circuit; 

buffer, port, service category, aggregation of VCs 

(alternatively referred to as a VC merge); 
buffer, port, sendee category. 

Simflarly, those skilled in the art will appreciate that 
numerous modifications and variations may be made to the 
preferred embodiment without departing from the spirit of 
the invention. 

What is claimed is: 

1. A method of processing packets at a queuing point in 
a communications device, the method comprising: 

receiving and associating packets with one of a plurality 
of traflBc flow sets each said set comprising one of 
adaptive traffic flows and non-adaptive traffic flows; 

dynamicaUy computing a target memory occupancy size 
for each said traffic flow set in accordance with a 
pre-determined dynamic fair buffer aUocation scheme; 

discarding packets associated with any of said traffic flow 
sets in the event the set is in a congested state; and 

prior to discarding packets due to congestion, discarding 
packets associated with the traffic flow sets containing 
adaptive traffic flows according to a dynamically com- 
puted probability of packet discard, wherein the prob- 
abifity of packet discard of any such traffic flow set is 
related to the target memory occupancy size thereof. 

2. Hie method according to claim 1, wherein packets are 
discarded from all said traffic flow sets. 

3. The method according to claim 1, including measuring 
the actual memory occupancy size of each said traffic flow 
set, and wherein a traffic flow set is congested when the 
actual memory occupancy size thereof reaches or exceeds its 
target memory occupancy size. 

4. The method according to claim 1, wherein the prob- 
ability of packet discard for a given traffic flow set is zero if 
the target memory occupancy size thereof is below a thresh- 
old value. 

5. A method of buffering packets in a communications 
device, the method comprising: 

defining a hierarchy of traffic flow sets the hierarchy 
including at least a top level and a bottom level, 
wherein each non-bottom level traffic flow set com- 
prises one or more child traffic flow subsets and 
wherein at one said non-bottom hierarchical level each 
said set within a group of traffic flow sets comprises one 
of adaptive or non-adaptive traffic flows; 

provisioning a target memory occupancy size for each 
top-level traffic flow set; 

dynamically determining a target memory occupancy size 
for each traffic flow set having a parent traffic flow set 
based on a congestion measiue of the parent traffic flow 
set; 

measuring the actual amount of memory occupied by the 
packets associated with each bottom level traffic flow 
set; 
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enabling the discard of packets associated with a given target occupancy size for all the child traffic flow sets of a 

bottom level traffic flow set in the event the actual common parent and provisioning each such chUd traffic flow 

memory occupancy size of the corresponding bottom set with a weight, wherein the target memory occupancy size 

level traffic flow reaches or exceeds the target memory each such chfld traffic flow set is a weighted amount of the 

o^upancy size thereof to thereby teheve congesUon; 5 nominal target occupancy size. 

„ „ 12, The method according to claim 11, wherein the 

enablmg packets associated with the traffic flow sets * _ - c r l-u * «= 

compriLg the adaptive traffic flows to be randomly "^"^"^ ^f^".' occupancy size for a group of child traffic 

discarded prior to the step of discarding packets for ^""^ * ^"'"^^^ P^°^ ^^6^^ i° accordance 

relieving congestion. lo ^ ptespedfied fimction \sx response to the congestion of 

6. The method according to claim 5, wherein packets are common parent traffic flow set. 

randomly discarded from the traffic flow sets comprising 13- Th^ method according to claim 12, wherein conges- 
adaptive flows and traffic flow sets comprising non-adaptive tion is correlated to a disparity between the target and 
flows. measured memory occupancy sizes of a traffic flow set. 

7. The method according to claim 5, wherein the prob- 15 ^4 -j^e method according to claim 9, wherein the target 
ability of discarding a packet associated with a given traffic memory occupancy size for each non-top level traffic flow 
flow set at said pre-selected level of said hierarchy is based set changes in accordance with a prespecified function in 
on (i) the target memory occup^cy size of said given traffic espouse to a di^arity between the target and measured 
flowset,or,(u)thetargetsizeofadescen^ memory occupancy siz^ of its parent traffic flow set. 
thereof which is also associated with said packet. 20 ^ xk- 1 ™ ♦ 1 * a i_ • i_ ^ 

ft rp. „ , J- * 1 L • L method according to claim 9, wherein a bottom 

8. The method accordmg to claim 7, wherem the prob- .^^c^^ ^ . . - j' j i * «= a 
ability of discard is zero if the actual memory occupancy ^T , 5! I ^^V^^, an mdividua^ traffic flow 
size of the corresponding traffic flow set is less than a "^^^^/^f ^rouP consisting of: a virtual conuecUon; 
predetermined to of its target memory occupancy size. ^ switdied path; and a logical stream of packets 

9. Tlie method according to claim 7, wherein each non-top 25 ^sultrng from the forwarding rules of a packet classifier, 
level traffic flow set is a subset of a traffic flow set located method according to claim 5, wherein said packet 
on an immediately higher level of the hierarchy and each *® ^ ^ packet, an AAL frame, and an ATM cell, 
non-bottom level traffic flow set is a superset of at least one method according to claim 4, wherein said packet 
traffic flow set located on an immediately lower level of the ^ of: an IP packet, an AAL frame, and an ATM cell, 
hierarchy. 30 18. The method according to claim 17, wherein conges- 

10. Hie method according to claim 9, including measur- tion is correlated to a disparity between the target memory 
ing the actual amount of memory occupied by each traffic occupancy size and a measured memory occupancy size of 
flow set of the hierarchy. a traffic flow set, 

U. The method according to claim 9, wherein said step of 
determining a target size includes computing a nominal 



